Decision trees and Ensemble Methods
DDD: Elements of Statistical Machine Learning & Politics of Data
Ayush Patel
At Azim Premji University, Bhopal
14 Feb, 2026
Did you come prepared?
You have installed R. If not see this link .
You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE
You have the libraries {tree} {gbm} {randomForest} {BART} installed
Learning Goals
What are decision trees?
When to use them?
How do these work?
Application and interpretation.
Improving decision trees using ensemble methods.
Decision Trees
Supervised and non-parametric Can be used for Classification and Regression A predictor space is cut into segments and mean response of training observations is used as the estimate of response of test observations. Simple, easy to interpret but not the best for prediction accuracy on its own Prediction accuracy can be improved by ensemble methods
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Model output - Predictor Space
from ISLR
Model output - Tree
from ISLR
Region representation
\(R1 = \left\{X|Years<4.5\right\}\) \(R2 = \left\{X|Years>=4.5,Hits<117.5\right\}\) \(R2 = \left\{X|Years>=4.5,Hits>=117.5\right\}\)
Tree terminology
from ISLR
How should we carry out segmenting of predictor space?
Segmenting - Theory
Predictor Space of \(p\) variables needs to be segmented into \(J\) different regions. In theory , the regions can be of any shape, however high-dimensional rectangles are chosen in practice for computational ease and interpretability For every observation in region \(R_j\) , we make the same prediction. Mean or mode of the training observations in region \(R_j\) Minimize: \(\sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i - \hat{y}_{R_j})^2}}\)
But
Not easy to consider all possible cutpoints for all possible predictors with all possible sequences We use high-dimensional rectangles instead of any shape for ease of interpretation.
So, use Recursive Binary splitting
top-down greedy
top-down
We brgin at the point where all observations are part of the same region. Hence the name top-down
Greedy
Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.
This is carried out recursively, over and over again.
Fitting Regresison Trees
from {palmerpenguins} website
Fitting Regresison Trees
tree (body_mass_g ~ ., data = penguins) -> peng_mass_tree
peng_mass_tree
node), split, n, deviance, yval
* denotes terminal node
1) root 333 215300000 4207
2) species: Adelie,Chinstrap 214 40430000 3715
4) sex: female 107 8493000 3419 *
5) sex: male 107 13240000 4010 *
3) species: Gentoo 119 29670000 5092
6) sex: female 58 4519000 4680 *
7) sex: male 61 5884000 5485 *
Fitting Regression Trees
var n dev yval splits.cutleft splits.cutright
1 species 333 215259666 4207.057 :ab :c
2 sex 214 40428633 3714.720 :a :b
4 <leaf> 107 8493224 3419.159
5 <leaf> 107 13241192 4010.280
3 sex 119 29674443 5092.437 :a :b
6 <leaf> 58 4519321 4679.741
7 <leaf> 61 5884098 5484.836
Fitting Regression Trees
1 2 3 5 6 7 8 13 14 15 16 17 18 19 20 21 22 23 24 25
4 3 3 3 4 3 4 3 4 4 3 3 4 3 4 3 4 3 4 4
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
3 4 3 3 4 3 4 3 4 3 4 4 3 3 4 3 4 3 4 3
46 47 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66
4 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86
3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106
4 3 4 3 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126
3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146
3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4
147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166
4 3 3 4 3 4 6 7 6 7 7 6 6 7 6 7 6 7 6 7
167 168 169 170 171 172 173 174 175 176 177 178 180 181 182 183 184 185 186 187
6 7 6 7 6 7 7 6 6 7 6 7 7 6 7 7 6 6 7 6
188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207
7 6 7 6 7 6 7 6 7 7 6 6 7 6 7 6 7 6 7 6
208 209 210 211 212 213 214 215 216 217 218 220 221 222 223 224 225 226 227 228
7 6 7 6 7 6 7 6 7 6 7 7 6 7 6 7 7 6 6 7
229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248
6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7 6 7
249 250 251 252 253 254 255 256 258 259 260 261 262 263 264 265 266 267 268 270
7 6 6 7 6 7 6 7 7 6 7 6 7 6 7 6 7 6 7 7
271 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291
6 6 7 6 7 3 4 4 3 4 3 3 4 3 4 3 4 3 4 3
292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311
4 4 3 3 4 3 4 3 4 3 4 3 4 3 4 3 4 3 4 4
312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331
3 3 4 3 4 4 3 4 3 3 4 3 4 4 3 3 4 3 4 3
332 333 334 335 336 337 338 339 340 341 342 343 344
4 3 4 4 3 4 3 3 4 3 4 4 3
Fitting Regression Trees
Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"
Number of terminal nodes: 4
Residual mean deviance: 97680 = 32140000 / 329
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-760.30 -219.20 15.16 0.00 220.30 815.20
Fitting Regression Trees
plot (peng_mass_tree)
text (peng_mass_tree, pretty = 0 )
Fitting Regression Trees
na.omit (penguins) |>
mutate (
pred_mass = predict (peng_mass_tree)
) |>
relocate (body_mass_g, pred_mass, everything ())
# A tibble: 333 × 9
body_mass_g pred_mass species island bill_length_mm bill_depth_mm
<int> <dbl> <fct> <fct> <dbl> <dbl>
1 3750 4010. Adelie Torgersen 39.1 18.7
2 3800 3419. Adelie Torgersen 39.5 17.4
3 3250 3419. Adelie Torgersen 40.3 18
4 3450 3419. Adelie Torgersen 36.7 19.3
5 3650 4010. Adelie Torgersen 39.3 20.6
6 3625 3419. Adelie Torgersen 38.9 17.8
7 4675 4010. Adelie Torgersen 39.2 19.6
8 3200 3419. Adelie Torgersen 41.1 17.6
9 3800 4010. Adelie Torgersen 38.6 21.2
10 4400 4010. Adelie Torgersen 34.6 21.1
# ℹ 323 more rows
# ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>
Do it yourself
Run the linear model using the penguins data: \(bodymass = species + sex\) compare the RSS for the linear model with the RSS of the decision tree.
Cant just keep splitting
some terminal nodes have very few observations
node), split, n, deviance, yval
* denotes terminal node
1) root 263 53320000 535.9
2) CHits < 450 117 5931000 227.9
4) AtBat < 147 5 2940000 709.5 *
5) AtBat > 147 112 1779000 206.4
10) CRBI < 114.5 74 302100 141.8 *
11) CRBI > 114.5 38 567200 332.1 *
3) CHits > 450 146 27390000 782.8
6) Walks < 61 104 9470000 649.6
12) AtBat < 395.5 53 2859000 510.0 *
13) AtBat > 395.5 51 4504000 794.7
26) PutOuts < 771 45 2358000 746.4 *
27) PutOuts > 771 6 1255000 1157.0 *
7) Walks > 61 42 11500000 1113.0
14) RBI < 73.5 22 3148000 885.3
28) PutOuts < 239.5 7 1739000 1156.0 *
29) PutOuts > 239.5 15 656300 758.9 *
15) RBI > 73.5 20 5967000 1363.0
30) Years < 13.5 14 3767000 1521.0
60) CAtBat < 3814.5 8 529600 1141.0 *
61) CAtBat > 3814.5 6 541500 2028.0 *
31) Years > 13.5 6 1026000 992.5 *
Cant just keep splitting
So, when to stop
The recursive splitting approach is likely to overfit Leads to a complex tree A realtively smaller tree can avoid overfitting via lower variance. Providing better interpretation at the cost of little bias One could say that we split only if reduction in RSS is larger than some value. But this can be short sighted
So what to do
Grow the full Tree. Prune it back to a smaller subtree But which is the best subtree Well, intuitively, one that leads to the lowest test error rate
Finding the best subtree
We can use cross-validation to estimate test error rate But there can be so many subtrees So, we can select some small number of trees using cost complexity pruning or weakest link pruning
Cost Complexity Pruning
\(\sum_{m=1}^{|T|}{\sum_{i:x_i \in R_m}{(y_i - \hat{y}_{R_m})^2}} + \alpha |T|\)
The Combined Algorithm
Use Recursive binary splitting to grow a large tree.
Apply cost complexity pruning to get a sequence of best subtrees, as a function of \(\alpha\)
Use K-fold CV to choose \(\alpha\) .
Repeat steps 1 and 2 on all but the Kth fold of training data
Evaluate the mse on the left-out Kth fold, as a fucntion of \(\alpha\) . Average results of each value of \(\alpha\) , choose ove that minimizes average error.
Return the subtree from step 2 that corresponds with the value of chosen \(\alpha\)
Step 1 - Grow the full tree
rsample:: initial_split (data = Hitters,prop = .75 ) -> hit_split
rsample:: training (hit_split) -> train_tree
rsample:: testing (hit_split) -> test_tree
tree (Salary ~ ., data = train_tree) -> sal_hit_tree
Step 2 - Apply Cost Complexity Pruning
$size
[1] 12 11 10 9 8 6 5 4 3 2 1
$dev
[1] 10103044 10453308 10810982 11357152 11913936 13180950 13937113 14864006
[9] 16751384 20426590 34883922
$k
[1] -Inf 350264.5 357674.1 546169.4 556784.6 633507.1
[7] 756162.4 926893.5 1887377.9 3675205.6 14457332.2
$method
[1] "deviance"
attr(,"class")
[1] "prune" "tree.sequence"
Step 3 - Apply CV to choose \(\alpha\)
cv.tree (sal_hit_tree,FUN = prune.tree) -> cv_pruned_estimates
cv_pruned_estimates
$size
[1] 12 11 10 9 8 6 5 4 3 2 1
$dev
[1] 23039605 22865159 22865159 23120219 23136553 22185870 21241115 21035958
[9] 20490343 23071670 35050196
$k
[1] -Inf 350264.5 357674.1 546169.4 556784.6 633507.1
[7] 756162.4 926893.5 1887377.9 3675205.6 14457332.2
$method
[1] "deviance"
attr(,"class")
[1] "prune" "tree.sequence"
Step 3 - Apply CV to choose \(\alpha\)
tibble (
leaves = cv_pruned_estimates$ size,
rss = cv_pruned_estimates$ dev
) |>
ggplot (aes (leaves, rss)) +
geom_point ()+
geom_line ()+
theme_minimal ()
Step 3 - Apply CV to choose \(\alpha\)
Step 4 - get the subtree from the large tree
prune.tree (sal_hit_tree, best = 3 ) -> best_sub_tree_sal
plot (best_sub_tree_sal)
text (best_sub_tree_sal, pretty = 0 )
Lets predict using the model
predict (best_sub_tree_sal, test_tree)
-Andre Dawson -Argenis Salazar -Andres Thomas -Alan Wiggins
913.5872 228.4322 228.4322 541.5909
-Billy Beane -Buddy Bell -Bobby Bonilla -Brett Butler
228.4322 913.5872 228.4322 913.5872
-Bob Horner -Bill Madlock -Ben Oglivie -Bip Roberts
913.5872 541.5909 541.5909 228.4322
-BillyJo Robidoux -Craig Reynolds -Cory Snyder -Dann Bilardello
228.4322 541.5909 228.4322 228.4322
-Daryl Boston -Dave Collins -Darren Daulton -Donnie Hill
228.4322 541.5909 228.4322 228.4322
-Dave Kingman -Don Mattingly -Dale Murphy -Darryl Strawberry
913.5872 913.5872 913.5872 913.5872
-Enos Cabell -Eddie Murray -Ernest Riles -George Bell
541.5909 913.5872 228.4322 913.5872
-Greg Brock -Gary Gaetti -Garth Iorg -Gary Pettis
228.4322 913.5872 541.5909 228.4322
-Hal McRae -Harold Reynolds -Herm Winningham -Juan Beniquez
541.5909 228.4322 228.4322 541.5909
-John Cangelosi -Jody Davis -Jim Dwyer -Jeffrey Leonard
228.4322 913.5872 541.5909 541.5909
-Joe Orsulak -Jorge Orta -Jamie Quirk -Jim Rice
228.4322 541.5909 228.4322 913.5872
-Jim Traber -Jose Uribe -Jerry Willard -Keith Hernandez
228.4322 228.4322 228.4322 913.5872
-Ken Landreaux -Kevin McReynolds -Larry Herndon -Lonnie Smith
541.5909 913.5872 541.5909 913.5872
-Lou Whitaker -Mike Fitzgerald -Milt Thompson -Mitch Webster
913.5872 228.4322 228.4322 228.4322
-Mookie Wilson -Mike Young -Oddibe McDowell -Omar Moreno
541.5909 228.4322 228.4322 541.5909
-Phil Garner -Pete O'Brien -Pete Rose -Pat Sheridan
541.5909 913.5872 541.5909 228.4322
-Pat Tabler -Rob Deer -Ron Kittle -Ray Knight
913.5872 228.4322 228.4322 913.5872
-Rick Leach -Ron Oester -Ryne Sandberg -Sid Bream
228.4322 913.5872 913.5872 228.4322
-Tony Bernazard -Tom Brookens -Toby Harrah -Tim Hulett
913.5872 541.5909 541.5909 228.4322
-Terry Kennedy -Tito Landrum -Tim Laudner -Terry Puhl
228.4322 228.4322 228.4322 541.5909
-Wally Backman
541.5909